home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Cream of the Crop 1
/
Cream of the Crop 1.iso
/
EDITOR
/
KDP32_1.ARJ
/
KANJI.DOC
< prev
next >
Wrap
Text File
|
1992-05-22
|
16KB
|
314 lines
CHARACTER CODING OF JAPANESE
******************************************************************************
* This archive contains the kanji font file KDP16SJ.FNT, which is needed *
* by the KDPLUS kanji preprocessor system. For those who would like to *
* know how the font file is organized, the following notes have been *
* provided which explain Japanese character coding. *
******************************************************************************
1) Starting point: the ku-ten table
All characters used in Japanese writing can be arranged in a table which is
called the "ku-ten" table. The table, which is universally used, is 94 columns
wide and 94 rows high, but rows 85 and up are empty (not used) at present.
Numbering of rows and columns starts at 1 (not zero). Any character can be
identified by specifying its row number (called its "ku" value) and its column
number (called its "ten" value).
The symbols in rows 1-47 are called "level 1 JIS (Japan Industrial Standard)
characters"; they are the most commonly used characters. Rows 48 and up are
called "level 2 JIS". The level 1 kanji (from row 16) are arranged according
to pronunciation (on-yomi normally) and stroke count.
A print-out of the "ku-ten" table can be found in the instruction manual of
every Japanese "wapro" (word-processor) and every Japanese printer. In many
"wapros" the ku-ten values of characters may be entered by hand. "Office
Automation Dictionaries", available in Japan, enable you to look up the "ku-
ten value" of any character.
The "ku-ten" table is not completely standardized in Japan. The standardiza-
tion applies only to rows 1-8 (kana, alphanumerics) and rows 16 and up
(kanji); they are defined in JIS standard X-0208. Rows 9-15 are left blank in
the standard and can, apparently, be filled in by manufacturers according to
their own ideas. The blank areas in rows 1-8 are considered "reserved".
The complete ku-ten table is contained in six files which go with this archive
(see section 5).
2) Kanji fonts
A kanji font is a set of binary data (a ROM chip, or a disk file) describing
the actual appearance of the symbols. The file KDP16SJ.FNT is an "almost"
standard 16 x 16 pixel kanji font (see section 7 for a summary of the changes
which were made). It contains bitmap images of characters, each bitmap 16
pixels wide and 16 pixels high; each bitmap therefore occupies 32 bytes.
The character bitmaps are arranged sequentially in the font file according to
the character's position in the ku-ten table. The offset (in bytes) of the
bitmap corresponding to character [ku,ten] is 32*((ku-1)*94+ten-1). The font
file contains bit-maps for the first 83 rows of the ku-ten table (row 85 and
up are empty anyhow, and row 84 contains only 5 rarely-used characters, so
this is no great loss). The total number of character images in the font is
thus 94*83=7802.
The ku-ten table contains many gaps (incompletely filled rows). For instance,
in row 8 only the first 32 places are filled (with line draw symbols), the
rest is blank. Row 14 originally contained only 3 symbols (but now we have
added some IBM control characters to that row). The blank areas are left blank
in the font file; in other words, they are not skipped, but are represented by
bit-map tables which consist of zeroes. This is, of course, a waste of space,
but it makes for flexibility (you can put your own symbols there if you wish)
and easy decoding.
In the file KDP16SJ.FNT, the bitmap images in rows 9, 10, and 11 use only the
left-hand half of the 16 x 16 pixel box. They can be displayed with a
horizontal spacing of 8 pixels. 8-pixel, or half-character, symbols are called
hankaku; characters which use the full 16 x 16 box are called zenkaku. In a 24
x 24 pixel font, zenkaku characters are be 24 pixels wide, hankaku characters
are 12 pixels wide (in the font KDP24SJ.FNT, used by KPLJ24, the hankaku
characters are in fact 13 pixels wide; KPLJ24 inserts 2 empty pixels between
zenkaku characters to keep the zenkaku spacing twice the size of the hankaku
spacing).
3) JIS coding
The number of columns in the ku-ten table, 94, is not arbitrary; it is derived
from the number of 7-bit ASCII characters. With 7 bits, 128 different
characters can be represented; leaving out the characters 0 and 127, and also
the characters 1-32 (control characters and space), we are left with 94
printable characters, having the numerical values 33-126.
Any character in the ku-ten table can now be represented by 2 bytes:
first byte : "ku" value + 32
second byte: "ten" value + 32
The first character in the ku-ten table, [ku=1, ten=1] is thus represented by
the two bytes [33,33]. The first kanji character in the table (the character
with pronunciation "A", meaning Asia), with ku=16 and ten=1, would be
represented by the bytes [48,33], or, in ASCII, "0!".
Thus we have a system of transmitting Japanese characters on channels which
use 7-bit characters (especially mainframe systems). This is called the JIS
code.
The problem which now arises is this: a terminal capable of receiving kanji
data according to the system described above would interpret each character as
one half of a kanji. It could not receive normal ASCII text without changing
it into some garbled mess of kanji and kana. It would, of course, be desirable
if the same terminal could interpret ASCII characters according to their
normal meaning ALSO. The solution which was adopted for this may be inelegant,
but is unavoidable within the limitations of the 7-bit format. It consists of
switching between two modes: "ASCII mode" and "kanji mode". The mode is
switched by means of an escape sequence. JIS code systems need two escape
sequences:
kanji in (KI) sequence: changes from ASCII mode to kanji mode
kanji out (KO) sequence: changes from kanji mode to ASCII mode
Of course, the disadvantage of this method is that the KI and KO strings may
become garbled in transmission, leaving the system in the wrong mode. But I
suppose a better solution wasn't possible in systems using only seven bits.
KI and KO strings differ, according to the "dialect" of the JIS code which is
in use. Three major dialects are "old JIS", "new JIS", and "NEC", which have
respectively:
KI KO
======= =======
old JIS ESC $ @ ESC ( H
new JIS ESC $ B ESC ( J
NEC ESC K ESC H (pica), ESC E (elite)
"Old JIS" is, for instance, used by JICST and the Nikkei Telecom News data-
base service. "New JIS" is used by the kanji editor program MOKE (by Mark
Edwards), and in the Japanese section of the GENIE network. NEC printers use
the NEC code.
Some JIS systems can also handle hankaku katakana characters. These characters
are encoded by one byte, with value 21 - 5f hex. To indicate that such codes
must be interpreted as hankaku katakana rather than normal ASCII, hankaku
katakana strings must be preceded and followed by special codes:
the character SO (Ehex) switches from ASCII to hankaku katakana;
the character SI (Fhex) switches from hankaku katakana to ASCII.
This system is used to communicate with the 7-bit, "old JIS" data-bank JICST.
You initiate a search by typing a keyword in ASCII or hankaku katakana
(JICST does not accept zenkaku characters for input). The response from the
system is in ASCII and "old JIS" zenkaku characters.
The default mode for JIS systems is ASCII mode.
4) EUC coding
EUC (Extended Unix Code) is a variant of JIS which is used on eight-bit UNIX
systems such as can be found in university environments. The coding system is
exactly the same as JIS, but the switch between ASCII mode and Kanji mode is
not indicated by escape strings. Instead, characters in kanji sequences have
the high bit set, while ASCII characters have the high bit cleared (zero).
5) SJIS coding
In bulletin board systems (which are always 8-bit), and frequently also for
internal character representation in Japanese personal computers, the so-called
SJIS code is used. SJIS means shift-JIS, probably to indicate that "shifted"
(high bit set) characters are used. They are used, however, in a way which is
very different from that of the EUC system.
There are three kinds of SJIS codes: controls, one-byte characters, and two-
byte characters.
Controls are represented by one byte, having the values 0-1f hex, or 0-31
decimal. Controls include codes for new line, carriage return, form feed, back
space, etc.
One byte characters are represented by one byte having a value ranging from 20
to 7E hex (32 to 126 decimal) or from A0 to DF hex (160 to 223 decimal). For
values in the rage 20 to 7E hex, the meaning of the characters is the same as
in standard ASCII. The range A1 to DF hex is used for hankaku katakana; these
values are the same as the JIS hankaku katakana, but with the high bit set. On
the IBM PC, this range is occupied by the "box draw" characters. The value
A0 hex represents a space (same as 20 hex).
A peculiarity is that on some systems (for instance the KDPLUS system) the
one-byte characters can also be coded with two bytes; this is the case when
the characters have been put somewhere in the non-standardized part of the ku-
ten table, so that they have a normal two-byte address. On some systems (an
example is the Ichitaro word-processing system on an AX) ASCII and hankaku
katakana are kept out of the ku-ten table altogether, so these characters can
only be selected with one-byte codes.
Two-byte characters are represented by a "high" byte followed by a "low" byte.
In order not to be mistaken for a control or a one-byte character, the "high"
byte must use values which are not used by those characters, in the ranges 81-
9F hex and E0-EA hex. The "low" byte uses values in the range 40-FC hex, but
the value 7F hex is skipped (not used). This may be a relic from the paper
tape era. On paper tape systems, "all holes punched" was never used for a
character, so that it was possible to erase characters on the tape by
overpunching them.
There are 188 possible values for the "low" byte and 42 for the "high" byte.
Every possible value of the "high" byte can now encode 2 rows (2 x 94
characters) of the "ku-ten" table. In total therefore, 84 rows could be
encoded, but only one row is encoded for the characters with "high byte" equal
to EA hex.
The algorithm for converting "ku-ten" values to "high-low" values is:
high=0x80+(ku+1)/2 ; /* 2 ku values share the same high byte. */
if (high>0x9F) high+=0x40; /* if outside 81-9F range, lift to E0-EA range*/
if (ku&1) { /* ku is odd*/
low= 0x3F+ten;
if (low>=0x7F) low++;
}
else low= 0x9E+ten; /* ku is even */
The decoding algorithm is equally straightforward: assume that we have already
determined that a two-byte character has been sent, and we have the "high" and
"low" bytes available. We calculate the "ku" and "ten" values as follows:
if (high>=0xE0) high-=0x40;
high-=0x80;
ku=2*high - 1; /* always produces an odd value */
if (low > 0x9E) { /* ku is even: increase the value */
ku++;
ten=low-0x9E;
}
else { /*ku is odd*/
if (low>0x7F) low--;
ten=low-0x3F;
}
The treatment of the one-byte characters depends on where the hankaku
characters are stored in the font, because this is hardly standardized. In the
font KDP16JS.FNT, the hankaku ASCII characters are stored in row 9, and
hankaku katakana in row 10. So we calculate "ku" and "ten" as follows:
if (ch<0x20) { /* control character */
/*....put appropriate code here....*/
}
else if ((ch==0x20)||(ch==0xA0)) { /* hankaku space */
ku=11;
ten=1;
}
/* The separate treatment of the hankaku space
is necessary, because, inconveniently, the
hankaku ASCII row in the font file does not
start with a space, but with the exclamation
mark (ASCII 0x21). We get the space from row
11, which does start with a space. */
else if ((ch>0x20)&&(ch <= 0x7E)) {/* ASCII */
ku=9;
ten=ch-0x20;
}
else if ((ch>0xA0)&&(ch <= 0xDF)) {/* hankaku katakana */
ku=10;
ten=ch-0xA0;
}
else { /* not a one-byte character, but
first half of two-byte character. */
/*....put appropriate code here.... */
}
Of course many tricks can be applied to make the code more compact and faster.
The separate treatment of the hankaku space can also be avoided with a small
trick. The above explanation shows the principle, however.
It is quite easy to make your program recognize KI and KO strings, and switch
automatically between SJIS and JIS coding. It is not so easy to distinguish
automatically between SJIS and EUC (at least not on the basis of single
characters).
6) "Ku-ten" table files
You can obtain a print-out of the ku-ten tables for Level 1 and Level 2 JIS
by printing the files:
level1.1
level1.2
level1.3
(for Level 1 JIS)
level2.1
level2.2
level2.3
(for Level 2 JIS)
Because the ku-ten tables are too wide to be printed on one sheet, they have
been split into three parts, covering the columns 1-32, 33-64, and 65-94
respectively. You can print all three of them on a Japanese printer or word
processing system, or on a "Western" printer using the print utilities of the
KDPLUS system. Glue the tables together to get complete ku-ten tables.
The tables are SJIS coded. To convert to JIS, use the KDPLUS SJIS2JIS utility.
7) Changes made in KDP16SJ.FNT
A few changes have been made in KDP16SJ.FNT to adapt it for use with KDPLUS
and the KDPLUS editor, JWRITE. The most important of those changes is that the
IBM control code symbols (corresponding to ASCII values below 32) have been
added to row 14 of the font, from position 11, and the IBM characters with
values EB-FE are in that same row from position 75. Furthermore, the character
ASCII 92 (5C hex), corresponding to ku=9, ten=60, is now displayed
as a backslash, to make it conform to normal IBM PC usage (in the original
KDP16SJ.FNT, as on most Japanese computer systems, this character is a "yen"
sign). Also, some cosmetic changes have been made in the "tilde", apostrophe,
reverse apostrophe, and quotation mark symbols, to make them useable as
accents. The "equals" sign (=) has also been slightly modified. In combination
with the capital Y, it makes a nice "yen" sign (through the accent facility of
JWRITE), should you need it.
If you don't like these changes, you can undo them using the font editor
KFEDIT that comes with KDPLUS.
Tokyo, 10 July 1991 (revised 14 January 1992, 16 February 1992, 20 May 1992)
Jan W. Stumpel